Data Engineer (m/f/d) | Legal AI Tech Start-up | Full Remote

Noxtua AG • Berlin, Berlin, DE • 1w ago

Our offer for you

Mission & Vision

As a Data Engineer (m/f/d), you will support the development of our legal data system by designing and maintaining a robust data infrastructure. You will be responsible for building and optimizing ETL pipelines that process legal data from multiple jurisdictions, while developing data models that ensure consistency, scalability, and accuracy across diverse datasets. A key part of your role will be implementing metadata enrichment strategies that enhance the searchability and usability of legal information. You will also conduct database performance benchmarking and tuning to guarantee efficient query execution and long-term scalability. Working closely with product teams, researchers, and legal domain experts, you will deliver high-quality, reliable data solutions that help us unlock the value of complex, multilingual legal content.

Your Team

You will join our AI Team led by Felix (Head of AI), working closely with a group of approximately 5 AI experts. This highly collaborative team focuses on pushing the boundaries of generative AI, natural language processing, and privacy-preserving machine learning legal solutions.

Your Hiring Manager

Felix, our Head of AI, will guide you through your journey at Noxtua. With deep expertise in AI systems, Felix leads with a passion for innovation and a collaborative approach, ensuring every team member thrives.

Benefits

Working hours: Flexible working hours: Full-time or Part-time
Vacation: 26 days + December 24th & 31st off
Remote: 100% remote work possible (given a European residence), other countries upon request
Discounts: e.g. Urban Sports Club Membership
Equipment: Laptop (Lenovo or Mac), second screen, keyboard etc.

Your responsibilities

Build and optimize ETL pipelines to process legal data from multiple jurisdictions, including chunking, embedding and ingesting legal data.
Develop and maintain data models that ensure consistency, scalability, and accuracy across diverse datasets and large amounts of data.
Coordinate data handover from different sources.
Implement metadata enrichment strategies to enhance searchability and usability of legal information.
Experiment with embedding strategies and training embedding models, including evaluation
Conduct database performance benchmarking and tuning to ensure efficient query execution and scalability.
Collaborate with product, AI, and legal domain experts to deliver high-quality, reliable data solutions.

What to expect after 6 months

Our Tech Stack

Programming Languages: Python
Data format: XML, parquet
Frameworks: Blob Storage systems like S3 (especially OTC OBS), langchain, langgraph
Vector Search: ElasticSearch, Qdrant, Pinecone
Graph Databases: Neo4j, Amazon Neptune, TigerGraph
Libraries: HuggingFace, Transformers, NumPy, Pandas, Pydantic, FastAPI, OpenAI & PyTorch
Deployment Tools: Docker
Cloud Infrastructure: OTC, AWS, GCP, Azure
Pipeline Orchestration: Apache Airflow, dagster, Prefect
Ticket System: Atlassian JIRA
Repository: Github
CI/CD System: GitHub Actions
Documentation: Confluence
Communication: Slack
Office Application: MS365

About you

Requirements:

Residence & Work Permit: Eligible to work in Germany or within the EU.
Language: English proficiency at C2 level.
Experience: in AI development or data engineering with successfully deployed projects
Data: Expertise in data processing, filtering, and augmentation
Databases: Expertise in vector databases, data embedding, benchmarking and management
Programming: Strong Python skills and experience with AI pipelines

Optional:

Experience in deploying graph databases
RAG Systems: Experience in building up AI specific RAG pipelines
NLP & Generative AI: Familiarity with developing and deploying NLP, generative AI models
Familiarity with Kubernetes deployments
Legal background knowledge